This dataset is from prosper and contains data for 113937 loans they made from 2006 to 2014. The main goal is to determine what variables set the interest paid by the borrowers (BorrowerRate)
First, we see the variables names, classes and summaries.
## 'data.frame': 113937 obs. of 10 variables:
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric.: int NA 6 NA 6 3 5 2 4 7 7 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## BorrowerRate EstimatedReturn ProsperRating..numeric.
## Min. :0.0000 Min. :-0.183 Min. :1.000
## 1st Qu.:0.1340 1st Qu.: 0.074 1st Qu.:3.000
## Median :0.1840 Median : 0.092 Median :4.000
## Mean :0.1928 Mean : 0.096 Mean :4.072
## 3rd Qu.:0.2500 3rd Qu.: 0.117 3rd Qu.:5.000
## Max. :0.4975 Max. : 0.284 Max. :7.000
## NA's :29084 NA's :29084
## EmploymentStatus CreditScoreRangeLower CreditScoreRangeUpper
## Employed :67322 Min. : 0.0 Min. : 19.0
## Full-time :26355 1st Qu.:660.0 1st Qu.:679.0
## Self-employed: 6134 Median :680.0 Median :699.0
## Not available: 5347 Mean :685.6 Mean :704.6
## Other : 3806 3rd Qu.:720.0 3rd Qu.:739.0
## : 2255 Max. :880.0 Max. :899.0
## (Other) : 2718 NA's :591 NA's :591
## DebtToIncomeRatio IncomeRange LoanOriginalAmount
## Min. : 0.000 $25,000-49,999:32192 Min. : 1000
## 1st Qu.: 0.140 $50,000-74,999:31050 1st Qu.: 4000
## Median : 0.220 $100,000+ :17337 Median : 6500
## Mean : 0.276 $75,000-99,999:16916 Mean : 8337
## 3rd Qu.: 0.320 Not displayed : 7741 3rd Qu.:12000
## Max. :10.010 $1-24,999 : 7274 Max. :35000
## NA's :8554 (Other) : 1427
## LoanOriginationDate
## 2014-01-22 00:00:00: 491
## 2013-11-13 00:00:00: 490
## 2014-02-19 00:00:00: 439
## 2013-10-16 00:00:00: 434
## 2014-01-28 00:00:00: 339
## 2013-09-24 00:00:00: 316
## (Other) :111428
I observe that I need to transform some variables. First the Prosper rating is a number and we will need it as factor, also the loan origination date is in datetime format and I proceed to extract years to study the changes over time.
I will combine Credit Score Range Upper and Lower into an Average because they measure the same variable and rearrange the order of Income Levels for a better plot visualization.
Also, I observe that a lot of variables have NAs, so in order to examine data with all of the variables I will remove rows with NAs.
ld$LoanOriginationDate.year = substr(ld$LoanOriginationDate, 1, 4)
ld$CreditScoreAverage <- (ld$CreditScoreRangeLower + ld$CreditScoreRangeUpper)/2
ld$ProsperRating..factor. <- factor(ld$ProsperRating..numeric.)
ld$IncomeRange <- factor(
ld$IncomeRange,
levels(ld$IncomeRange)[c(8,1,2,4,5,6,3,7)])
ld.no_na <- ld[complete.cases(ld),]
summary(ld.no_na)
## BorrowerRate EstimatedReturn ProsperRating..numeric.
## Min. :0.0400 Min. :-0.18160 Min. :1.000
## 1st Qu.:0.1349 1st Qu.: 0.07408 1st Qu.:3.000
## Median :0.1845 Median : 0.09060 Median :4.000
## Mean :0.1934 Mean : 0.09529 Mean :4.132
## 3rd Qu.:0.2524 3rd Qu.: 0.11500 3rd Qu.:5.000
## Max. :0.3600 Max. : 0.26670 Max. :7.000
##
## EmploymentStatus CreditScoreRangeLower CreditScoreRangeUpper
## Employed :65884 Min. :600 Min. :619
## Full-time : 7585 1st Qu.:660 1st Qu.:679
## Other : 3526 Median :700 Median :719
## Retired : 320 Mean :699 Mean :718
## Part-time : 199 3rd Qu.:720 3rd Qu.:739
## Self-employed: 42 Max. :880 Max. :899
## (Other) : 1
## DebtToIncomeRatio IncomeRange LoanOriginalAmount
## Min. : 0.0000 $50,000-74,999:24030 Min. : 1000
## 1st Qu.: 0.1500 $25,000-49,999:22023 1st Qu.: 4000
## Median : 0.2200 $100,000+ :14019 Median : 8000
## Mean : 0.2588 $75,000-99,999:13644 Mean : 9269
## 3rd Qu.: 0.3200 $1-24,999 : 3840 3rd Qu.:14500
## Max. :10.0100 Not employed : 1 Max. :35000
## (Other) : 0
## LoanOriginationDate LoanOriginationDate.year CreditScoreAverage
## 2013-11-13 00:00:00: 459 Length:77557 Min. :609.5
## 2014-01-22 00:00:00: 443 Class :character 1st Qu.:669.5
## 2013-10-16 00:00:00: 412 Mode :character Median :709.5
## 2014-02-19 00:00:00: 404 Mean :708.5
## 2014-01-28 00:00:00: 312 3rd Qu.:729.5
## 2013-09-24 00:00:00: 298 Max. :889.5
## (Other) :75229
## ProsperRating..factor.
## 1: 5739
## 2: 8617
## 3:12824
## 4:16890
## 5:14631
## 6:13705
## 7: 5151
Next, I looked at the histograms for the BorrowerRate with 3 bins sizes to determine the shape of the distribution, I selected bins of 1%, 2,5% and 0.5%.
In addition, I created a Box Plot
To get a better understanding I plot the mean, median and interquartile measures in the 1% bin size histogram.
From the previous plot we observe that the distribution is close to a normal one but has a spike at about 0.32, which moves the mean and the median from the center of the curve. This must be further investigated.
The dataset is in a wide format, with each row containing the information for a single loan, the loan and the borrower characteristics
The main feature of interest is the Borrower Rate, I want to be able to predict what rate a certain costumer would get.
In order to do this, I will take a closer look at the variables of related to the borrowers like credit score, income range, debt to income ratio, the amount of the loan and the rating prosper gives to each borrower.
Also, I created a credit average score, because it made sense to compare the data to a single credit than an upper and lower score. Also, I extracted the year from the loan origination date because I wanted to analyze the loans over the years and not a more granular time metric. Finally, I created a factor variable of the prosper ratings to be able to clearly differentiate the different borrowers in the plots.
I will use GGpairs as my first plot to get an overview of the relationship between the variables of study.
Observations:
-Borrower Rate seems to be a normal distribution with a peak on a higher rate -Correlation Between Borrower rate and other variables: High: ProsperRating..numeric. (-0.953), EstimatedReturn (0.827). Medium: CreditScoreAverage (-0.529), LoanOriginal Amount (-0.405). Low: DebtToIncomeRatio (0.126) -ProsperRating..discrete. seems to be the biggest influence in Borrower Rates, since borrower rate boxplots are higher for lower ratings. -EmploymentStatus and IncomeRange does not seems to have much influence on Borrower Rate. - There is not any clear linear relationship between BorrowerRate and other variable
Firt I created various plots to analyze the proster rating vs te borrower rate
## # A tibble: 7 x 8
## ProsperRating..factor. borrower_rate_min borrower_rate_1st_q
## <fctr> <dbl> <dbl>
## 1 1 0.1779 0.3134
## 2 2 0.1518 0.2712
## 3 3 0.1158 0.2287
## 4 4 0.0895 0.1765
## 5 5 0.0693 0.1414
## 6 6 0.0500 0.0999
## 7 7 0.0400 0.0699
## # ... with 5 more variables: borrower_rate_mean <dbl>,
## # borrower_rate_median <dbl>, borrower_rate_3rd_q <dbl>,
## # borrower_rate_max <dbl>, n <int>
From the previous plots, we can see that prosper rating is indeed a very influential variable in the borrower rate. In fact, we can see that the borrower rate mean, median, interquartile distributions, min and max decrease as the rating goes up.
Also, it is very clear that most of the borrower with rating 1 get approximately 0.31 Rate. Them costumers with rating 2 have a greater spread than those with rating 3, and this repeats up to rating 7, where the interquartile is very close to the median.
In addition, when I plotted the histograms faceted by rating we can see that all ratings but rating = 1 have a close to normal distribution.
Next, I proceed to study the influence of other variables on the Borrower Rate
The result of the previous plots clearly shows that almost all of the variables have little to no effect on the borrower rate, this includes de date, the amount, the employment status, and debt to income.
The principal factor is the prosper rating. From the client side, besides the rating only the income range, loan amount and credit score seems to have some influence on the rate, decreasing it as the conditions are more favorable.
In addition, the expected return plot is interesting, there seems to be some linear relationships, another variable might help clear this out.
Since it is clear that the main factor on the borrower rate is the prosper rating, I will try to discover other relationships between this rating and the other variables differencing each rating in the previous plots.
First, we can conclude that neither LoanOriginationDate, DebtToIncomeRatio, LoanOriginalAmount, IncomeRange nor EmploymentStatus have any clear relationship with the Borrower Rate.
Also, we can see that prosper rating is influenced by credit score average, however, the spread is big, with very different credit scores having the same rating.
In addition, we can see that over the years the rates for the ratings groups was maintained.
Lower ratings tend to have bigger debt to income rating, however this is not a rule, there are people with rating 1 and close to 0 debt to income ratio.
Lower ratings tend to get lower loan amounts, and as the loan amount increases, we see that the amounts tend to be round numbers, there are several loans made with amount of 10,000, 15,000, 20,000, 25,000, 30,000 and 35,000.
There are linear relationships for the estimated return and borrower rate when we group the data by rating, this describes that the bank calculates the higher rate the higher the return will be, a relationship closes to +1. However, we can also see that even with higher rates, the lower ratings are expected to return very low. Especially costumers with prosper rating of 1 tend to create loses. In addition, there seems to be another variable that divides that group, because multiple linear relationships within the group.
In the first Plot, I observe that the distribution of the borrower rates (interest) for the loans its similar to a normal distribution if rating 1 is omitted, because almost of the costumer with this range get about 0.32 rate. Also, the other groups seem are in order and clearly separated, best ratings obtains bests rates. This is measured by the correlation, which is -0.953, a very strong relationship
This plot is interesting because it demonstrated that better credit scores tend to get better borrower rates, however the relationship is not strong with a correlation of -0.529. Also, the fact that same credit score averages tend to have several costumers with different ratings. For example, we can find all the ratings in good amount in the 730-credit score average bin. This is very surprising for me because I thought that this will be a very influential variable on borrower rate and prosper rating. We can conclude that there are other variables influencing the rating and borrower rate.
This final plot stands out because even thus I did not find any linear relationships in the other variables, here we see that if we divide each group of ratings, we could create several linear relationships.
This describes that the bank calculates that the higher rate the return will be higher too, a relationship close to +1. However, we can also see that this is the case for the ratings besides 1, where even the higher ratings are expected to return lower.
In conclusion costumers with prosper rating 1 tend to create loses. Also, there there seems to be another variable that divides that group, because multiple linear relationships may be within the group.
Data analysis is a process where you first you have to understand each variable and the problem before doing any plots, because if you don’t understand the situation you can’t formulate the questions that will get the answers you need.
It is very important to clean up the data, because some variables might have classes or formats that will produce bad visualizations and findings. After that decide which is your main characteristic of interest, what are you trying to predict and study that variable alone, only after you have an understanding of it you can explore the relationships it has with the other variables.
Create plots for all the variables and try to find key findings that will lead your investigation until you find an explanation of the phenomenon’s discovered.
On a final note, I believe that the data is as valuable as it is real or trustworthy. Even thus is it from a great source the analyst must be aware of the limitations of the data due to how was is recollected, from who and under what circumstances.